#!/usr/bin/perl

use strict;
use warnings;

use FindBin;

=head1 SECTION 3.2 BENCHES

This section presents the common parts of the benchs for section 3.2.

=over

=item

B<Core technical details>

    The actual measurements are done by the files main_speed.c. They
    use
    [_rdtsc](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=rdtsc&expand=4512)
    to accurately measure cycles.  They run NB_LOOP (default 100000)
    encryptions of a BUFF_SIZE (default 4096) bytes buffer.

    All the benchmarks output are .tex files, which contain macro
    defining the numbers reported in the paper.

    Note that when the code size is decreased by an
    optimization instead of increased (for instance, it is the case
    when inlining DES), the macro generated contains a negative
    number. However, we re-formated those number in order to have them
    positive instead of negative in the paper (thus allowing us to say
    something like "... while actually reducing code size by xx%";
    which wouldn't have be possible with negative numbers).

    The results are in the "results" folders of each bench.

=item

B<Customization>

    The benchs are ran $NB_LOOP times (defined at the begining of
    run.pl, default to 20). It will usually take each of them up to a
    few minutes to run.

    The script has 3 parts, controlable through command line flags:
     * -g: regenerate the C codes from the Usuba sources. (not by default)
     * -c: recompile the C binaries (default).
     * -r: run the benchmark (default).

=back

=cut
    






=head2 INTERLEAVING

=over

=item 

B<Paper>: section 3.2 Back-end, paragraph Interleaving

=item 

B<bench directory>: bench/interleaving

=item 

B<bench run script>: bench/interleaving/run.pl.  


=item 

B<High-level description>

    The benchs runs interleaved and non-interleaved versions of
    Serpent and Rectangle generated by Usuba, and generates the file
    results/interleaving.tex which contains the speedup and binary
    size increase due to interleaving. 
    It will also generate results/{serpent.txt,rectangle.txt} which
    contains the details of each measurement (those numbers will also
    be printed on stdout).
    

=item

B<Specific details>

    The C codes generated by Usuba are already present in the
    directory (the files names xxx_ua.c). They can however be
    recompiled; see bellow using the -g flag.  The other C files are
    runtime stuffs (stream.c, *.h, serpent.c, key.c, ...).

=back

=cut

chdir "$FindBin::Bin";
chdir "interleaving";
system "./run.pl @ARGV";








=head2 SCHEDULING

=over

=item 

B<Paper>: section 3.2 Back-end, paragraph Scheduling bitsliced code

=item 

B<bench directory>: bench/scheduling-bs

=item 

B<bench run script>: bench/scheduling-bs/run.pl.  


=item 

B<High-level description>

    This bench compares the codes generated by usuba for bitsliced DES
    and AES, with and without scheduling.
    It generates the file results/scheduling-bs.tex, which contains
    the macros \SchedulingBitslice***Speedup (speedup gained by
    scheduling on cipher ***) and \SchedulingBitslice***Code (code
    size increase/decreased due to scheduling on cipher ***). Those
    numbers come in the paper right after "On bitsliced DES, scheduling
    after inlining increases throughput by...".

    
=item

B<Specific details>

    What to look at in run.pl: 
    lines 57-58: generates C code from usuba
    code without (-no-sched) and with (no options; scheduling is done
    by default) scheduling.
    lines 68-69: compiles the binaries for the bench (with and without 
    scheduling)
    lines 85-89: run the benchs, stores the results.
    lines 95-104: print measurements to stdout (and to the .txt files)
    lines 106-113: computes the speedup/sizes.
    lines 117-end: prints the results.

=back

=cut
    
chdir "$FindBin::Bin";
chdir "scheduling-bs";
system "./run.pl @ARGV";







=head2 INLINING

=over

=item 

B<Paper>: section 3.2 Back-end, paragraphs Inlining, Scheduling bitsliced code and Scheduling m-sliced code

=item 

B<bench directory>: bench/inlining


=item 

B<bench run script>: bench/inlining/run.pl.  


=item 

B<High-level description>

    This bench generates the file results/inlining-nosched.tex and
    results/inlining-sched.tex, which contain the following macros:
    * inlining-nosched.tex: (bitsliced ciphers only, no inlining)
      - \InliningNosched***Speedup: the speedup offered by inlining in cipher ***
      - \InliningNosched***Code: the increase/decrease in code size caused by inlining
    * inlining-sched.tex:
      This might be confusing, but the meaning of Scheduling in this file depends on
      the cipher (the slicing type actually): for AES and DES (bitsliced), it is 
      bitslice scheduling, whereas for Chacha20 and AES H-sliced, it is m-slice 
      scheduling.
      - \InliningScheduling***Speedup: the speedup offered by inlining and scheduling
      - \InliningScheduling***Code: the increase/decrease in code size caused by inlining

    More specifically, the numbers of the paragraph Inlining are
    \InliningNoschedDESSpeedup (44.8), \InliningNoschedDESCode (9.1),
    \InliningNoschedAESSpeedup (24.24), and \InliningNoschedAESCode
    (24.8) (in that order). 
    Paragraph Scheduling bisliced code: only the last two numbers
    ("Overall, comining inlining and scheduling reults in a net..")
    are generated by this benchs (the previous four number are
    generated by bench/scheduling/run.pl):
    \InliningSchedulingDESSpeedup (45.8) \InliningSchedulingAESSpeedup
    (26.22).
    Paragraph Scheduling m-sliced code: only the first two number are
    generated by this bench ("This scheduling algorithm increased the
    throughput of ..."): \InliningSchedulingHAESSpeedup (2.43) and
    \InliningSchedulingChachaSpeedup (9.09).

    The details of each measurements are available in results/*.txt,
    and are printed to stdout during the bench, though you shouldn't
    need them.

=item

B<Specific details>

    I'll run you quickly through the script bench/inlining/run.pl so
    you can understand what happens.

    lines 32-44: ciphers for the bench, and their slicing types.

    C file generation:
    line 61: $sched_opt will be either -no-sched (ie, don't perform scheduling),
    or '-sched-n 10' (ie, perform scheduling, and sets the lookahead window to 
    10 for the m-slice scheduling).
    line 65-66: compiles the Usuba codes, either without inlining (-no-inline),
    or with inlining (-inline-all).

    C file compilation:
    lines 75-80: compile each cipher for each combination of scheduling/inlining

    Benchmark run:
    line 95-98: runs each binary for each cipher (l.88, for my $cipher),  each 
    scheduling (l.87, for my $sched), and each inlining (l.94, for my $inline),
    and stores the results inside the hash %res.
    line 104-113: generates the .txt files (which contains the details of the
    measurements).
    line 115-122: computes the speedups, and code size ratios.
    line 126-end: generates the .tex files, containing the numbers computed above.

=back

=cut

    
chdir "$FindBin::Bin";
chdir "inlining";
system "./run.pl @ARGV";






=head2 UNROLLING

=over

=item 

B<Paper>: section 3.2 Back-end, paragraph Scheduling m-sliced code

=item 

B<bench directory>: bench/unrolling

=item 

B<bench run script>: bench/unrolling/run.pl.  


=item 

B<High-level description>

    This bench compares the codes generated by usuba for m-sliced AES
    and Chacha20.
    It generates the file results/unrolling.tex, which contains the
    macros \Unrolling***Speedup (speedup gained by unrolling cipher
    ***) and \Unrolling***Code (code size increase/decreased due to
    unrolling cipher ***). Those numbers come in the paper at the end
    of the paragraph Scheduling m-sliced code, right after "On AES
    (resp. Chacha20), this yields a ...".

    
=item

B<Specific details>

    What to look at in run.pl: 
    lines 60-61: generates C code from usuba code without (no options;
    unroll is not done by default) and with (-unroll) unrolling.
    Note that scheduling and inlining are enabled (line 56).
    lines 71-72: compiles the binaries for the bench (with and without 
    unrolling)
    lines 88-94: run the benchs, stores the results.
    lines 98-107: print measurements to stdout (and to the .txt files)
    lines 109-116: computes the speedup/sizes.
    lines 119-end: prints the results.

=back

=cut
    
chdir "$FindBin::Bin";
chdir "unrolling";
system "./run.pl @ARGV";






=head1 TABLE 3 BENCHES

=over

=item

B<Paper>: Table 3. Comparison between Usuba code & reference implementations

=item 

B<bench directory>: bench/ua-vs-human

=item 

B<bench run script>: bench/ua-vs-human/run.pl.

=item 

B<High-level description>

    This benchmark generates the file human.tex, which defines macros cotaining
    the throughput, code sizes, speedup and latency displayed in Table 3.
    
    It uses three external sources the run the benchmarks, and only
    gathers the results afterward.

B<External sources>: (you should understand "external source" as
"script" (but not exactly))

    1. The benchmark for DES can be found in ciphers/des/compile.pl, and
    it very similar to the scripts running the benchmarks for section
    3.2 (see above).

    2. The benchmark for Rectangle is ran by
    ciphers/rectangle/bench.pl. Its structure is different from the
    other benchmarks, because it relies on C++ code (the code provided
    by the authors of Rectangle). The folder ciphers/rectangle
    contains various implementations of Rectangle generated by Usuba,
    whose names describe their content:
      - avx/sse/gp: the architecture
      - bitslice/nslice/vector: the slicing (nslice stands for
        Hslicing, and vector for Vslicing).
      - inter: if present, the implementation uses interleaving
      - inline: if present, was compiled with -inline-all
    You shouldn't try to recompile those Usuba files, as the resulting
    C files have been manually modified to be compatible with
    C++. (Which just meant changing the types). Another bench
    (Monomorphization, see below), allows you to recompile the Usuba
    files for rectangle, which exhibits similar throughputs as the
    ones from this benchmark.
    You might want to have a look at ciphers/rectangle/main.cpp,
    function `speed` (line 72) which contains the code that takes care
    of the measurements. It is very similar to the code from
    benchmarks of section 3.2 (files main_speed.c): it uses _rdtsc()
    to measure how many cycles it takes to encrypt NB_LOOP times an
    IN_SIZE bytes buffer. It runs both the reference (line 85) and the
    Usuba (line 96) code for a given architecture. Note that the
    architecture isn't mentionned in the source of main.cpp, but it
    appears in types.h (lines 57, 59 and 61), and in the
    stream_xxx.cpp files (which are the runtimes for the Usuba codes).

    3. The benchmarks for AES (H-sliced), Serpent (V-sliced) and
    Chacha20 (V-sliced) rely on Supercop. Be advised that running
    Supercop will take a while (probably a few hours). Supercop
    produces files containing the cycles needed to encrypt a buffer of
    a given size (see for instance
    '/supercop-data/hostname/amd64/try/c/clang_-march=native_-O3_-fomit-frame-pointer_-fwrapv_-std=gnu11/crypto_stream/chacha20/usuba-avx-fast/data').


B<Collecting the results>. (this is actually the content of bench/ua-vs-human/run.pl).


    1. DES (lines 47-74): the results are already nicely formatted in
    ciphers/des/results.txt, which was generated by
    ciphers/des/compile.pl. We are only looking at the lines
    containing "-std" since Kwan's implementation was written on
    General purpose registers.
    To the code size of Usuba, we add the size of the circuits for the
    Sboxes, which can be found in 'data/sboxes/des_*.ua' (and totalize
    about 500 lines).

    2. Rectangle (lines 307-352): The results are in
    ciphers/rectangle/results.txt (generated by
    ciphers/rectangle/bench.pl). We only need to parse it, and
    format the results.

    3. Supercop (lines 77 to 300). For each cipher and architecture,
    the variables $ref_file and $ua_file contain the filename of the
    best results for both usuba and the reference implementation. The
    "interesting" bits of the filename are:
      - crypto_stream/.... : shows which implementation is being considered
      - try/c/....-fomit : shows which compiler / compiler flags are being used.
    The function get_speed_supercop (lines 361-375) finds the lines
    containing "xor_cycles 4096" in the data file (which means that a
    buffer 4096 bytes was encrypted), and computed the averages of the
    numbers that follow on those lines, eliminating the lower values
    (which almost every time are to blame on context-switch or other
    OS-related issues).
    Counting the lines of code is not automated in most cases because
    reference implementations tend to fuse runtime and primitive, or
    include unrelated stuffs in the same file as their primitive (like
    decryption) and thus a simple `cloc` would not just count the
    primitive. We manually counted the lines of code and hardcoded the
    numbers, with some details when possible (see line 88-92 for
    instance). Both the Usuba and reference source codes can be found
    in supercop/crypto_stream/xxx.

=back


=cut

chdir "$FindBin::Bin";
chdir "ua-vs-human";
system "./run.pl @ARGV";










=head1 FIGURE 3 BENCHES

=over

=item

B<Paper>: Figure 3. Scalability of SIMD compilation

=item 

B<bench directory>: bench/scaling-avx512

=item 

B<bench run script>: bench/scaling-avx512/run.pl

=item 

B<High-level description>

    This benchmark generates the file plot/speedup.pdf, which is the
    Figure 3 in the paper.
    Raw data can be found in plot/data-speedup.dat.

    If you just want to generate speedup.pdf for measurements that
    have already been done, you can simply run
    `./scaling-avx512/run.pl -l`; this should be really fast.


B<Technical details>

    Each folder (except 'plot') of bench/scaling-avx512 contains a
    generic (in the sense that it can be ran on different
    architectures) runtime for a given cipher.
    A script 'compile.pl' in each directory takes care of generating
    the C files, compiling them and running the benchmark for the
    cipher. If your computer doesn't have AVX512 registers available,
    it shouldn't be an issue: you'll just have a bunchs of 0s instead
    of the AVX512 measurements. This scripts are quite similar to
    those for section 3.2, so we won't detail too much how they work.

    The script scaling-avx512/run.pl takes care of calling each of this
    compile.pl script, and then gathers the results, normalizes them
    (in order to always have "speed SSE = 1", and then the other
    speeds relative to SSE), and then generates Figure 3.


=back

=cut

chdir "$FindBin::Bin";
chdir "scaling-avx512";
system "./run.pl @ARGV";








=head1 FIGURE 4 BENCHES

=over

=item

B<Paper>: Figure 4. Monomorphizations of Rectangle

=item 

B<bench directory>: bench/rectangle

=item 

B<bench run script>: bench/run.pl

=item 

B<High-level description>

    This benchmark generates the file plot/slicing-compare.pdf, which is the
    Figure 4 in the paper.
    Raw data can be found in plot/*.dat, and plot/results.txt.


B<Technical details>

    This bench works quite like the benchs from section 3.2.
    It first compiles the Usuba files (if -g option was
    supplied) with each slicing (-B for bitslice, -V for vslice, -H
    for hslice). Then compiles the C benchs, and then run the
    benchmarks.
    You might want to have a look at main.c, which does the actual
    measurements.


=back

=cut

chdir "$FindBin::Bin";
chdir "rectangle";
system "./run.pl @ARGV";
